Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT) by xiehuanyi · Pull Request #1528 · openai/parameter-golf

xiehuanyi · 2026-04-10T18:42:21Z

Summary

UPDATED 2026-04-11: Replaces the earlier 1.1104 BPB result with a much stronger 1.07035 BPB (TTT) / 1.07266 (sliding) using the exact PR #1493 SOTA recipe (SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ + Brotli + legal score-first TTT), adapted for 1×A100 instead of 8×H100.

Beats upstream main-leaderboard SOTA:

PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 sliding: 1.0827 → this: 1.07266 (−0.01004)
PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 TTT: 1.0810 → this: 1.07035 (−0.01065)

Still non-record because the run was on 1×A100 for 4h (≈80 H100-minute-equivalent of raw BF16 throughput, but not on the required hardware and without FA3).

What's in this PR

The training script is the decompressed PR #1493 train_gpt.py (their LZMA+base85 one-liner) with three minimal adaptations for Ampere + Python 3.9:

FA3 → FA2/SDP fallback. A100 doesn't support FlashAttention-3. The attention wrapper now tries flash_attn (FA2) first, then falls through to PyTorch scaled_dot_product_attention with the flash backend. The SDP path adds a manual GQA head-repeat (PyTorch SDP doesn't natively support num_heads != num_kv_heads).
Python 3.9 compat. Removed zip(strict=True) and nested double-quoted f-strings.
GRAD_ACCUM_STEPS env override. Added so single-GPU runs can override the default 8 // world_size. Not actually used in this submission (defaults kept), but left in for flexibility.

Everything else is identical to PR #1493: SP8192 vocab, 11L×512d×8H/4KV, MLP 4x, depth recurrence looping layers 3-5 (17 virtual from 11 physical, activated at frac=0.35), parallel residuals layer 7+, QK-Gain 5.25, skip gates, MuonEq-R + AdamW, WD=0.095, EMA=0.9965, warmdown_frac=0.72, matrix_lr=0.022, GPTQ SDClip (k=12.85 mats / k=20.0 embs), int6 attn+mlp / int8 embs, Brotli-11 + byte shuffle, legal score-first TTT (SGD lr=0.005 mom=0.9, 3 epochs/32K chunk).

Numbers (seed 1337)

Metric	Value
Pre-quant post-EMA	1.07610
Int6 quantized	1.08950
Int6 + sliding window (s=64)	1.07266
Int6 + sliding + legal TTT	1.07035
Steps trained	6371 / 20000 (wallclock capped at 4h)
Peak GPU memory	41.8 GiB / 80 GiB (A100)
Model params	35,944,536
Artifact bytes	15,970,123
Total submission	16,019,227 (under 16 MiB)

Hardware equivalence

Main leaderboard budget: 8 × H100 × 10 min = 80 H100-minute-equivalent
This submission: 1 × A100 × 240 min = ~76–80 H100-minute-equivalent
(H100 BF16 ≈ 3.17× A100 BF16, plus FA3 is Hopper-only so there's an additional ~1.5× gap we don't close)

Comparison with exp60 / exp61 (same training config, different QK_gain)

Three runs of the same config differing only in QK_GAIN_INIT:

Run	QK_GAIN	Quant	Sliding	TTT
exp60	5.0	1.09031	1.07345	1.07137
exp61	5.0	1.09031	1.07345	1.07137
exp62	5.25	1.08950	1.07266	1.07035

The SOTA record's non-default QK_GAIN_INIT=5.25 consistently helps all three quant/eval phases, confirming the paper's "monotonic improvement from 4.0 to 5.25" observation.

Caveats

Single seed (1337). A 3-seed mean has not been run for time reasons.
exp60 and exp62 both crashed with SIGSEGV at the end of their own eval pipelines (torch.compile recompile issue when creating a fresh GPT instance for eval after training). The saved quantized artifacts were then evaluated successfully via a standalone eval_only.py script. exp61 completed its full eval pipeline natively.
grad_accum=2 variant (exp63/64) OOM'd at startup: SOTA model with MLP 4× + depth recurrence has a per-micro-batch footprint that doesn't fit on A100 at quarter the accum.

Test plan

Training completes within 4h wallclock cap
Quantization + sliding window eval produces valid int6 artifact under 16 MiB
Sliding window BPB beats upstream SOTA 1.0827 (got 1.07266)
TTT BPB beats upstream SOTA 1.0810 (got 1.07035)
Artifact round-trip through brotli + byte-shuffle without errors
3-seed reproduction (not done)

Files:

README.md with full recipe + numbers + reproduction commands
submission.json with structured metadata
train_gpt.py — A100-adapted SOTA script
final_model.int6.ptz (15.97 MB)
train_seed1337.log + eval_seed1337.log
requirements.txt

Longer-context + longer-training variant of the ValCalib_GPTQ_XSA_BigramHash3072 stack. Moves TRAIN_SEQ_LEN 1024 -> 2048 and runs for 4h on 1x A100 (no H100 available), which together bring sliding-window int6 BPB from 1.1317 (s1024, 2h) down to 1.11044406 (s2048, 4h). Non-record because the submission was trained on 1x A100 for 240 minutes (roughly equivalent to 76-80 H100-minutes, close to the 80 H100-minute official budget) rather than on the required 8xH100 x 10min hardware. Artifact: 15.94 MB int6+lzma, total submission 16.04 MB (under 16 MiB limit). Model: 27M params, 11L 512d 3xMLP, XSA-all, BigramHash(2048), PartialRoPE(16/64), LN Scale, SmearGate, Muon+AdamW WD=0.04, EMA(0.997 deferred), SWA, Late QAT@0.15, Int6 GPTQ with self-generated AR calibration, LZMA preset=9, sliding window eval stride=64. Currently single-seed (1337). Seeds 42 and 999 are running and will be added to submission.json once complete.

Copilot

Pull request overview

Adds a new non-record leaderboard submission under track_non_record_16mb for an 11-layer full-stack model trained at seq_len=2048 for 4h on 1×A100, reporting val_bpb=1.11044406 with int6 GPTQ + LZMA and sliding-window eval (stride=64).

Changes:

Adds the full submission bundle (training script, run log, metadata JSON, README, requirements) for 2026-04-10_s2048_4h_1xA100_1.1104.
Updates the training script for A100 environments (FA2/SDP attention fallback, deferred EMA start, Python 3.9 compatibility).
Records reported metrics, artifact sizes, and reproduction instructions.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py	Training/eval/quantization script used to produce the submission artifact and metrics
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_seed1337.log	Captured run log with reported metrics and byte sizes
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/submission.json	Structured metadata for the submission (metrics, sizes, config)
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/README.md	Human-readable summary, numbers, and reproduction command
records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/requirements.txt	Minimal dependency list for reproducing the run

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-10T18:47:27Z

records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py

+    seq_len = eval_seq_len or args.train_seq_len
+    total_tokens = val_tokens.numel() - 1
+    window_starts = [ws for ws in range(0, total_tokens, stride)
+                     if min(ws + seq_len, total_tokens) - ws >= 1]
+    total_windows = len(window_starts)


eval_val_sliding currently includes window starts all the way to total_tokens, which creates short tail windows (wlen < seq_len). For ws>0 these tail windows score tokens that were already scored by the last full window, slightly over-weighting the end of the validation set and contradicting the “every token scored exactly once” sliding-window definition used elsewhere (e.g. the TTT window_starts filter in this file). Consider restricting window_starts to full windows (ws <= total_tokens - seq_len) and/or filtering with wlen >= stride or ws == 0 to avoid double-counting.

Copilot · 2026-04-10T18:47:27Z

records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py

+    log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}")
+    log0(f"attn_backend:{_ATTN_BACKEND} sdp_backends:cudnn=False flash=True mem_efficient=False math=False")
+    log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}")


The logged SDP backend flags are hard-coded (mem_efficient=False), but earlier you call enable_mem_efficient_sdp(True). This makes the run metadata in train_seed1337.log misleading; please either query the actual backend settings or update the log string to match what is enabled.

Copilot · 2026-04-10T18:47:28Z

records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/train_gpt.py

+try:
+    import zstandard
+    _COMPRESSOR = "zstd"
+except ImportError:
+    _COMPRESSOR = "zlib"
+import numpy as np


The optional zstandard import and _COMPRESSOR selection appear unused (no references elsewhere in this script), while the submission path always uses LZMA. This adds dead code and an unnecessary dependency surface; consider removing _COMPRESSOR/zstandard/zlib if they’re not meant to be toggled, or wire them into an actual compressor option.

Copilot · 2026-04-10T18:47:28Z

records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/submission.json

+      "val_bpb_int6_roundtrip": 1.13437381,
+      "artifact_bytes": 15941100,
+      "total_submission_bytes": 16040603,
+      "steps": 14065,
+      "step_avg_ms": 1023.86
+    }
+  },
+  "artifact_bytes_max": 15941100,
+  "bytes_total": 16040603,


artifact_bytes / artifact_bytes_max don’t match the sizes in the included train_seed1337.log. The log reports Serialized model int6+lzma: 15920436 bytes and Total submission size ...: 16040603 bytes, implying artifact_bytes should be 15920436 (and code bytes ~120167), not 15941100. Please recompute these fields from the actual final_model.int6.ptz and script size so metadata stays self-consistent.

Copilot · 2026-04-10T18:47:28Z

records/track_non_record_16mb/2026-04-10_s2048_4h_1xA100_1.1104/README.md

+| Peak memory | 16.3 GiB |
+| Model params | 26,993,756 |
+| Artifact bytes (int6+lzma) | 15,941,100 |
+| **Total (code + artifact)** | **16,040,603** (under 16 MiB = 16,777,216) |
+


The README’s artifact/total byte counts appear inconsistent with the included training log. train_seed1337.log reports Serialized model int6+lzma: 15920436 bytes and Code size: 120167 bytes (total 16040603), but this README lists Artifact bytes ... 15,941,100. Please update the README numbers to match the actual generated files (or regenerate the log/README from the same run) so readers can verify the 16 MiB constraint.

Replaces the earlier 1.1104 non-record submission with a much stronger result that reproduces the PR openai#1493 SOTA 1.0810 recipe on 1xA100 for 4h instead of the required 8xH100 for 10min. Key numbers (seed 1337): - Int6 Sliding Window: 1.07266 BPB (beats upstream SOTA 1.0827 by -0.0100) - Int6 + Legal TTT: 1.07035 BPB (beats upstream SOTA 1.0810 by -0.0107) - Pre-quant post-EMA: 1.07610 BPB - Steps trained: 6371 (wallclock capped at 4h) - Total submission: 16,019,227 bytes (under 16 MiB) This is the exact PR openai#1493 SOTA recipe (SP8192 + 3-layer recurrence + parallel residuals layer 7+ + QK-Gain 5.25 + MuonEq-R + SDClip GPTQ + Brotli + byte shuffle + legal score-first TTT) with three A100 adaptations: 1. FA3 -> PyTorch SDP fallback with manual GQA head-repeat (A100 doesn't support FA3) 2. Python 3.9 compatibility (removed zip(strict=True) and nested double-quoted f-strings) 3. GRAD_ACCUM_STEPS env override for single-GPU runs Three seeds of the same config ran (exp60, exp61, exp62). exp60/62 crashed in their own eval phase with a torch.compile recompile issue when creating a fresh GPT instance after training; the saved quantized artifacts were then evaluated successfully via a standalone eval_only.py script. exp62 (QK_GAIN_INIT=5.25, the exact SOTA record value) beat exp60/exp61 (QK_GAIN_INIT=5.0, the script default) consistently across quant/sliding/TTT metrics, matching the "monotonic improvement from 4.0 to 5.25" observation in the SOTA paper. Still single-seed; 3-seed mean is not yet run due to time constraints.

MatoTeziTanka · 2026-04-12T04:50:19Z

Community Review — Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)

BPB: 1.07035 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1413 dexhunter pattern)

What I found in the code (head SHA c01e4dac462a, file records/track_non_record_16mb/2026-04-11_SP8192_SOTA_QK525_TTT_1.0704_1xA100/train_gpt.py):

The TTT path at line 356 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape of the current leaderboard's legal frontier (PR #1413 dexhunter, the 1.0828 SP8192 + QK-Gain 5 + Legal TTT entry — verified at its head SHA against the is_last_chunk + torch.no_grad() score-first accumulator pattern).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 12.31s, dim=512, layers=11, vocab=8192, code=49104 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Copilot AI review requested due to automatic review settings April 10, 2026 18:42

Copilot AI reviewed Apr 10, 2026

View reviewed changes

xiehuanyi changed the title ~~Non-record: 11L s2048 4h on 1xA100 — 1.1104 BPB~~ Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT) Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528

Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)#1528
xiehuanyi wants to merge 2 commits intoopenai:mainfrom
xiehuanyi:submission/s2048-4h-a100-1.1104

xiehuanyi commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

Copilot AI Apr 10, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiehuanyi commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Numbers (seed 1337)

Hardware equivalence

Comparison with exp60 / exp61 (same training config, different QK_gain)

Caveats

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: SP8192 + SOTA recipe on 1xA100 — 1.07035 BPB (TTT)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiehuanyi commented Apr 10, 2026 •

edited

Loading